Genetic background of Polygenic Disease by MCMC

نویسندگان

  • Alexander V. Favorov
  • Timophey V. Andreewski
  • Marina A. Sudomoina
  • Olga O. Favorova
  • Giovanni Parmigiani
  • Michael F. Ochs
چکیده

In recent years, the number of studies focusing on the genetic basis of common disorders with a complex mode of inheritance, in which multiple genes of small effect are involved, has been steadily increasing. An improved methodology to identify the cumulative contribution of several polymorphous genes would accelerate our understanding of their importance in disease susceptibility and our ability to develop new treatments. A critical bottleneck is the inability of standard statistical approaches, developed for relatively modest predictor sets, to achieve power in the face of the enormous growth in our knowledge of genomics. The inability is due to the combinatorial complexity arising in searches for multiple interacting genes. Similar “curse of dimensionality” problems have arisen in other fields, and Bayesian statistical approaches coupled to Markov chain Monte Carlo (MCMC) techniques have led to significant improvements in understanding. We present here an algorithm, APSampler, for the exploration of potential combinations of allelic variations positively or negatively associated with a disease or with a phenotype. The algorithm relies on the rank comparison of phenotype for individuals with and without specific patterns (i.e., combinations of allelic variants) isolated in genetic backgrounds matched for the remaining significant patterns. It constructs a Markov chain to sample only potentially significant variants, minimizing the potential of large data sets to overwhelm the search. We tested APSampler on a simulated data set and on a case-control MS (multiple sclerosis) study for ethnic Russians. For the simulated data, the algorithm identified all the phenotype-associated allele combinations coded into the data and, for the MS data, it replicated the previously known findings. Genetic background of Polygenic Disease by MCMC p. 4 It is generally accepted now that genetic susceptibility to diseases with a complex mode of inheritance is explained by the presence of multiple genes each conferring a small contribution to the overall risk (TABOR et al. 2002). The complexity increases because similar disease-prone phenotypes may be produced by different genes in the same pathways as well as by alternative sets of genes providing disease heterogeneity. Due to the success of the human genome project (MCPHERSON et al. 2001; VENTER et al. 2001) and the development of high-throughput sequencing and genotyping technologies (THE INTERNATIONAL HAPMAP CONSORTIUM 2003; SHERRY et al. 2001), there has been a rapid increase in the availability of genetic data for numerous polymorphous loci, including SNPs, repeat polymorphisms and insertions/deletions. This allows the collection of large sets of genetic data, which could be key in the dissection of the genetic basis of complex diseases. Standard analytical approaches developed for simple etiologies present problems when dealing with complex etiologies involving multiple genes (THORNTON-WELLS et al. 2004). An approach that has shown great promise in areas with similar dimensionality problems is Markov chain Monte Carlo (MCMC) exploration using a Bayesian statistical basis (GILKS et al. 1996). Bayesian methods use the MCMC technique to make inferences that take into account a study’s data, as well as additional independent information. For instance, if genes were known to be in linkage disequilibrium, a measurement on the variant of one would provide information on the second, whether it was measured or not. Such information could be included through a prior probability distribution. In general, the final inference is represented by a posterior probability distribution, which includes information from the likelihood, derived from the fit of a model to the data, and prior knowledge of the subject encoded in the prior distribution. Genetic background of Polygenic Disease by MCMC p. 5 In statistical genetics, Bayesian approaches have become popular in recent years as computational power has increased to a point where these methods can be fully utilized. In addition, the completion of the human genome project has provided a substantial body of information on gene locations, potential linkages, and SNPs, which are often best incorporated in an analysis by Bayesian approaches (RANNALA 2001). There are numerous recent examples of the application of Bayesian methods in genetics that include population studies, quantitative trait loci mapping, and familybased studies (reviewed in BEAUMONT and RANNALA 2004). While the analysis of models with potentially complex interaction is not new to statistics and artificial intelligence, the complexity and size of the data analyses we currently face cannot be efficiently tackled with existing methods. In special settings, such as case-control and discordant sib-pair studies with a moderate number of alleles, exhaustive pattern searches can be conducted using multifactor dimensionality reduction (HAHN et al. 2003). This method has been effective in identifying a four-way interaction among alleles, but the method is not highly scalable, and one can only consider one pattern, albeit complex, at any given time. Larger model spaces can be explored using statistical model search procedures such as stochastic search variable selection (GEORGE and MCCULLOCH 1993). These require a substantial computational effort and often rely on model assumptions that are difficult to test. Recursive partitioning methods are also commonly used to investigate complex interactions. One example is logic regression (KOOPERBERG and RUCZINSKI 2005; KOOPERBERG et al. 2001), which can search for multiple patterns, each including interactions. However, most recursive partitioning approaches have a difficult time identifying complex interactions between predictors, when those are not showing significant main effects, a critical feature of epistasis. Genetic background of Polygenic Disease by MCMC p. 6 Our approach to surmount these obstacles can be outlined as follows. We are interested in searching over a space of candidate pattern sets, in which each pattern can be a complex genotypic pattern with multiple alleles involved. Evaluation of each of the possible candidates is not feasible for realistic problems because of the number of alleles typed. This suggests a stochastic search approach using MCMC technologies (GILKS et al. 1996; LIU 2001; ROBERT and CASELLA 1999). Implementation requires an a posteriori distribution reflecting the strength of the evidence provided by the data in favor of an association between each pattern included in the pattern set and the phenotype. Our approach is based on a practical approximation to such a posterior, built upon the distribution of a statistic for the nonparametric evaluation of the null hypothesis of no association between the patterns and phenotype. We will deal with the confounding of the patterns by a procedure that is the equivalent of a statistical adjustment, and that we term “pattern isolation”. We say that a pattern is considered isolated from some other patterns if we remove the influence of all these other patterns on the trait level before we consider its association with the level. The algorithm is intended to identify sets of patterns that are associated with the trait when considered in mutual isolation. METHODS Overview: The type of allelic patterns we seek are of interest in complex genetic diseases and include multiple alleles that are associated with a trait in combination rather than individually. We consider the general situation in which we have, for each individual, both a list of typed alleles at a fixed set of candidate loci and the phenotype of interest. Our method is based on ranks, so the phenotype can be measured as a continuous variable or as an ordinal categorical variable. While quantitative phenotypic measurements are powerful when available, it is useful in many applications Genetic background of Polygenic Disease by MCMC p. 7 to have a more general methodology that only requires comparing individuals to each other, as is the case with ranks. Our approach is designed to search for correlations between complex genetic patterns and phenotype. These correlations are captured via differences in the distributions of phenotype across two subsets of the population, defined by whether a certain allelic pattern is present or not. We consider a broad range of possible genetic models by allowing every allele to potentially affect the phenotype irrespective of its counterpart on the other chromosome. For example, our approach covers dominant and recessive models, as well as their combinations. When looking for polygenic disease patterns, an important challenge arises from the fact that it is not sufficient to consider candidate patterns one by one, because one pattern may confound the measurement of association for another. Thus, we seek a set of patterns. While we do not consider explicitly the issue of removing the possible influence of environmental factors on the phenotype, such a generalization is possible by modifying the test statistic used to construct the likelihood. Data structure: The typical raw data structure to which our algorithm applies is represented in Table 1, where each row corresponds to an individual. Measurements include a phenotypic variable and the results of genotyping a set of loci on the genome. While these would generally be SNPs, genotypes arising from the sequencing of genes or chromosomal regions would produce appropriate data as well. We set no limit to the number of different alleles that can be observed at a locus in the data set and assume that data are available for the two chromosomes at each locus, although we do not distinguish the two chromosomes presently. If we do not have information about an allele, we denote this with a zero in one of the two locations defining the locus. Genetic background of Polygenic Disease by MCMC p. 8 Allelic patterns: An allelic pattern is defined here as follows. If there are L loci, a pattern is a 2Ldimensional vector. Each entry corresponds to a locus-chromosome combination. Each value is either a label for a specific allele or a 0, if the variant is irrelevant for the phenotype. Patterns are illustrated in Table 2. We set no limit to the number of loci that can be involved in a pattern. Patterns in a set can be independently contributing to the phenotype or may act in concert. To account for this possibility we consider pattern sets, which are collections of patterns. Patterns will be indexed by n and pattern sets by s. The total numbers of patterns is N and the total number of pattern sets is therefore S = 2 . To keep the computation manageable, we will restrict the search to pattern sets with a fixed number of patterns. The number of loci involved in a single pattern controls the order of interaction among loci. The number of patterns in a set controls the number of genetic effects that need to be simultaneously considered to avoid masking and confounding effects. To search for pattern sets it is useful to define a data structure, called the pattern presence matrix, indicating whether a certain pattern is present or absent in each individual. This is illustrated in Table 3 and will be the basic data structure used in the algorithm. We use the notation yi for the phenotype of individual i, and xin for entry i,n of the pattern presence matrix, indicating whether pattern n is present in individual i. The symbols y and xn without further subscripts will represent the corresponding random variables. If we do not know the value of xin , because we do not obtain the necessary genotypic information concerning the individual i, we omit this individual and the corresponding row in the presence matrix when considering that pattern. Such an individual is included in calculations for other patterns if the genotyping information allows the determination of whether the individual carries that pattern. Genetic background of Polygenic Disease by MCMC p. 9 Pattern level comparisons: In our approach, the fundamental comparison (henceforth the “atomic” comparison) is between two groups of individuals whose presence matrix rows differ only in one column, i.e. differs only by the presence or absence of a single pattern. This comparison brings about the concept of mutual isolation of the patterns. Geometrically, we could represent all the 2 N pattern configurations in which an individual may fall by vertices of a unitary hypercube of dimension N (see Figure 1). Any pair of vertices that differ only by the presence of a single pattern is connected by an edge. All parallel edges of the hypercube correspond to the same pattern difference between sets. An atomic comparison is a comparison of two adjacent configurations on the same edge of the hypercube. The generic edge will be denoted by e and the pattern that is different between the nodes connected by the edge by n(e). The set of all edges associated with pattern n is denoted by En , and it includes 2 N −1 elements, each corresponding to a configuration of all patterns other than n. Statistically, an atomic comparison is a conditional comparison, while a comparison of all parallel edges at once would be a marginal comparison. Pattern n(e) is associated to the phenotype, conditional on the particular configuration implied by e, if the two probability distributions, p( y | x1,...,xn = 0,...,xN ) and p( y | x1,...,xn = 1,...,xN ) , differ. In particular, we say that the pattern is conditionally positively (negatively) associated with the phenotype if the distribution p( y | x1,...,xn = 1,...,xN ) is stochastically larger (smaller) (PRATT and GIBBONS 1981) than that of p( y | x1,...,xn = 0,...,xN ) , and we say that the pattern is not associated if the two distributions are the same. To represent this association, we define αe as follows: α e = +1 if n e is conditionally positively associated to the phenotype −1 if n e is conditionally negatively associated to the phenotype 0 if n e is not conditionally associated to the phenotype 

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Genes Predisposing to Monogenic, Polygenic, and Syndromic Obesity: A Review of Current Trends and Prospects for Standard Obesity Genetic Testing

Objective: The burden of obesity is currently enormous, necessitating a novel strategy to complement the existing ones. Accordingly, genetic predisposition is suspected in many cases of the disease, which can potentially be used as therapeutic targets. However, there are differing viewpoints on the suspect genes, prompting the current review to articulate the genes and their mechanisms. Eight (...

متن کامل

MCMC-based linkage analysis for complex traits on general pedigrees: multipoint analysis with a two-locus model and a polygenic component.

We describe a new program lm_twoqtl, part of the MORGAN package, for parametric linkage analysis with a quantitative trait locus (QTL) model having one or two QTLs and a polygenic component, which models additional familial correlation from other unlinked QTLs. The program has no restriction on number of markers or complexity of pedigrees, facilitating use of more complex models with general pe...

متن کامل

Mapping-linked quantitative trait loci using Bayesian analysis and Markov chain Monte Carlo algorithms.

A Bayesian method for mapping linked quantitative trait loci (QTL) using multiple linked genetic markers is presented. Parameter estimation and hypothesis testing was implemented via Markov chain Monte Carlo (MCMC) algorithms. Parameters included were allele frequencies and substitution effects for two biallelic QTL, map positions of the QTL, and markers, allele frequencies of the markers, and ...

متن کامل

A Bayesian multilocus association method: allowing for higher-order interaction in association studies.

For most common diseases with heritable components, not a single or a few single-nucleotide polymorphisms (SNPs) explain most of the variance for these disorders. Instead, much of the variance may be caused by interactions (epistasis) among multiple SNPs or interactions with environmental conditions. We present a new powerful statistical model for analyzing and interpreting genomic data that in...

متن کامل

Inm-6: Molecular Genetic Basis of Infertility

Background: Sexual reproduction affords the stands for conserving genetic characteristics and sequentially, genetic inconsistency may influence the capability to imitate. Materials and Methods: Research was conducted by subject in PubMed and other databases. Results: A significant number of genotypes have been related with infertility phenotypes and evaluation of precise genes in humans and mod...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005